Exploratory Data Analysis of Tropical Storms in R

The disastrous impact of recent hurricanes, Harvey and Irma, generated a large influx of data within the online community. I was curious as to what the history of hurricanes and tropical storms looked like so I found a data set on data.world and started some basic Exploratory data analysis (EDA).

EDA is crucial to starting any project. Through EDA you can start to identify errors & inconsistencies in your data, find interesting patterns, see correlations and start to develop hypotheses to test. For most people, basic spreadsheets and charts are pretty handy and provide a great place to start. They are an easy-to-use method to manipulate and visualize your data quickly. Data scientists may cringe at the idea of using a graphical user interface (GUI) to kick-off the EDA process but the reality is clear, those tools are very effective and efficient when used properly. However, if you’re reading this, you’re probably trying to take EDA to the next level. The best way to learn is to get your hands dirty, let’s get started.

The original source of the data was can be found at DHS.gov.


Step 1: Take a look at your data set and see how it is laid out

FID YEAR MONTH DAY AD_TIME BTID NAME LAT LONG WIND_KTS PRESSURE CAT BASIN Shape_Leng
2001 1957 8 8 1800Z 63 NOTNAMED 22.5 -140.0 50 0 TS Eastern Pacific 1.140175
2002 1961 10 3 1200Z 116 PAULINE 22.1 -140.2 45 0 TS Eastern Pacific 1.166190
2003 1962 8 29 0600Z 124 C 18.0 -140.0 45 0 TS Eastern Pacific 2.102380
2004 1967 7 14 0600Z 168 DENISE 16.6 -139.5 45 0 TS Eastern Pacific 2.121320
2005 1972 8 16 1200Z 251 DIANA 18.5 -139.8 70 0 H1 Eastern Pacific 1.702939
2006 1976 7 22 0000Z 312 DIANA 18.6 -139.8 30 0 TD Eastern Pacific 1.600000

Fortunately, this is a tidy data set which will make life easier and appears to be cleaned up substantially. The column names are relatively straightforward with the exception of “ID” columns.

The description as given by DHS.gov:

This dataset represents Historical North Atlantic and Eastern North Pacific Tropical Cyclone Tracks with 6-hourly (0000, 0600, 1200, 1800 UTC) center locations and intensities for all subtropical depressions and storms, extratropical storms, tropical lows, waves, disturbances, depressions and storms, and all hurricanes, from 1851 through 2008. These data are intended for geographic display and analysis at the national level, and for large regional areas. The data should be displayed and analyzed at scales appropriate for 1:2,000,000-scale data.

Step 2: View some descriptive statistics

YEAR MONTH DAY WIND_KTS PRESSURE
Min. :1851 Min. : 1.000 Min. : 1.00 Min. : 10.00 Min. : 0.0
1st Qu.:1928 1st Qu.: 8.000 1st Qu.: 8.00 1st Qu.: 35.00 1st Qu.: 0.0
Median :1970 Median : 9.000 Median :16.00 Median : 50.00 Median : 0.0
Mean :1957 Mean : 8.541 Mean :15.87 Mean : 54.73 Mean : 372.3
3rd Qu.:1991 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.: 70.00 3rd Qu.: 990.0
Max. :2008 Max. :12.000 Max. :31.00 Max. :165.00 Max. :1024.0

We can confirm that this particular data had storms from 1851 - 2010, that means the data goes back roughly 100 years before naming storms started! We can also see that the minimum pressure values are 0, which likely means it could not be measured (due to the fact zero pressure is not possible in this case). We can see that there are recorded months from January to December along with days extending from 1 to 31. Whenever you see all of the dates laid out that way, you can smile and think to yourself, “if I need to, I can put dates in an easy to use format such as YYYY-mm-dd (2017-09-12)!”

Step 3: Make a basic plot

This is a great illustration of our data set and we can easily notice an upward trend in the number of storms over time. Before we go running to tell the world that the number of storms per year is growing, we need to drill down a bit deeper. This could simply be caused because more types of storms were added to the data set (we know there are hurricanes, tropical storms, waves, etc.) being recorded. However, we should keep it in mind when we start to develop hypotheses.

You will notice the data starts at 1950 rather than 1851. I made this choice because storms were not named until this point so it would be difficult to try and count the unique storms per year. It could likely be done by finding a way to utilize the “ID” columns. However, this is a preliminary analysis so I didn’t want to dig too deep.

Step 3: Create data from within the plot

YEAR 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
Distinct_Storms 10 6 8 8 10 7 9 10 13 13 20 12 15 14 15 24 25 26 23 26 30 21 20 28 25 24 14 30 18 25 27 28 25 33 34 23 26 26 28 35 21 33 23 27 29 21 27 27 21 34 30 27 32 27 43 28 26 33
Distinct_Storms_Change -3 -4 2 0 2 -3 2 1 3 0 7 -8 3 -1 1 9 1 1 -3 3 4 -9 -1 8 -3 -1 -10 16 -12 7 2 1 -3 8 1 -11 3 0 2 7 -14 12 -10 4 2 -8 6 0 -6 13 -4 -3 5 -5 16 -15 -2 7
Distinct_Storms_Pct_Change -0.23 -0.40 0.33 0.00 0.25 -0.30 0.29 0.11 0.30 0.00 0.54 -0.40 0.25 -0.07 0.07 0.60 0.04 0.04 -0.12 0.13 0.15 -0.30 -0.05 0.40 -0.11 -0.04 -0.42 1.14 -0.40 0.39 0.08 0.04 -0.11 0.32 0.03 -0.32 0.13 0.00 0.08 0.25 -0.40 0.57 -0.30 0.17 0.07 -0.28 0.29 0.00 -0.22 0.62 -0.12 -0.10 0.19 -0.16 0.59 -0.35 -0.07 0.27
Distinct_Storms Distinct_Storms_Change Distinct_Storms_Pct_Change
Min. : 6.00 Min. :-15.0000 Min. :-0.42000
1st Qu.:15.75 1st Qu.: -3.0000 1st Qu.:-0.12000
Median :25.00 Median : 1.0000 Median : 0.04000
Mean :22.81 Mean : 0.3448 Mean : 0.05966
3rd Qu.:28.00 3rd Qu.: 3.7500 3rd Qu.: 0.25000
Max. :43.00 Max. : 16.0000 Max. : 1.14000

## # A tibble: 257 x 4
## # Groups:   CAT [5]
##       CAT  YEAR Distinct_Storms Distinct_Storms_Pct_Change
##    <fctr> <int>           <int>                      <dbl>
##  1     H1  1950              12                         NA
##  2     H1  1951               8                      -0.33
##  3     H1  1952               6                      -0.25
##  4     H1  1953               6                       0.00
##  5     H1  1954               6                       0.00
##  6     H1  1955               9                       0.50
##  7     H1  1956               4                      -0.56
##  8     H1  1957               6                       0.50
##  9     H1  1958               7                       0.17
## 10     H1  1959               8                       0.14
## # ... with 247 more rows